Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

【Hackathon No.38】为 Paddle 优化 deformable_conv op 在 GPU 上的计算性能 #218

Merged
merged 6 commits into from
Sep 7, 2022

Conversation

Rayman96
Copy link
Contributor

@Rayman96 Rayman96 commented Aug 23, 2022

提案内容有修改 辛苦审核

@paddle-bot
Copy link

paddle-bot bot commented Aug 23, 2022

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

@@ -0,0 +1,91 @@
Poisson OP性能优化设计文档
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

题目修改

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改


# 1 背景与意义

目前Paddle中的Deformable_conv的GPU版是使用cuBlas+CUDA kernel实现的,kernel实现方式与论文原作者的实现类似是讲CPU的kernel迁移到了GPU上,对于CUDA代码未进行针对性优化。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

讲->将

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

对于此OP在目前飞桨框架(Develop分支)中的性能现状调研,表格形式列出[OP Benchmark](https://github.com/PaddlePaddle/benchmark/tree/master/api/tests_v2)中各种case场景下的OP性能数据(Tesla P4)。

### 时间分析
通过benchmark运行deformable_conv的测试,其中执行了前向和后向的过程,需要对二者使用时间进行拆分, 下表列出核心的几部分。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实运行op benchmark时将 --backward设置为False就可以,这样数据仅包含前向看起来更直观

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的 学会了🙏 已更新数据


根据kernels运行时间分析:65%时间消耗在No.5上,而cuBlas本身的实现难以有较大优化空间,所以单独优化两个kernel较难实现目标。

根据CUDA API时间分析:63%时间消耗在同步,26%时间消耗在内存分配。通过减少需要线程同步的次数,降低数据在内存和CUDA间迁移应该可以较大程度优化。故优化点1和优化点2是重点考虑的对象。总体来看运行过程是串行的,每个im2col结束后执行gemm,然后再进行下一个im2col,gemm的过程。
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

降低数据在内存和CUDA间迁移,这里指的是Host与Device之间的数据拷贝吗?
这里能否将API的测试时间也补充一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

指的是Host与Device之间的数据拷贝的。
这里API数据只计算前向的话分布有更新,添加到时间描述一节中,此处也已修改描述,通过API的情况无法看不出什么明显的结论

+ 优化点1: 通过优化grid, block数量寻找更优配置。
+ 优化点2: 将deformable_conv_kernel_impl中计算像素和权重乘积的循环迁移到ModulatedDeformableIm2colGpuKernel中,将col_buffer的并行计算和output_3d的计算整合,减少部分搬运开销
+ 优化点3: 将deformable_conv_kernel_impl中的(batch_size / im2col_step)次循环并行化,目前用循环的方式im2col_step完成后才能进行下一个step,等待时间是无必要的。
+ 优化点4: 单独优化deformable_conv_kernel_impl中计算像素和权重乘积的循环;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

句尾标点符号还是统一一下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

## 2.2 Host / Device 端计算流程
1. 针对优化点1: 考虑通过paddle已实现的gpu_launch_config.h中GetGpuLaunchConfig1D方法获得较优的参数配置,或手动对BlockSize的不同大小进行性能测试验证(可能有一定优化空间)

2. 针对优化点2: ModulatedDeformableIm2colGpuKernel的Host端多接入两个参数,Device端计算完成col_buffer后继续计算output(可能有较大优化空间)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里能否描述的更详细些?如使用配图或者伪代码,因为这个优化在上述描述中应该是重点

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加伪代码描述


2. 针对优化点2: ModulatedDeformableIm2colGpuKernel的Host端多接入两个参数,Device端计算完成col_buffer后继续计算output(可能有较大优化空间)

3. 针对优化点3: 将整个im2col_step的过程并行化形成新的kernel,包含im2col和gemm两个步骤(可能有较大优化空间)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已添加伪代码描述


## 3 测试和验收的考量

实现前向速度提升超过25%
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否给出提升超过25%的评估逻辑

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

@Rayman96
Copy link
Contributor Author

请问下现在Paddle有办法使用动态并行吗?想实现在gpu kernel中调用blas计算矩阵乘法。
测试在kernel直接使用blas = phi::funcs::GetBlas会报错说calling a host function from a global function is not allowed

@ZzSean
Copy link

ZzSean commented Sep 1, 2022

请问下现在Paddle有办法使用动态并行吗?想实现在gpu kernel中调用blas计算矩阵乘法。 测试在kernel直接使用blas = phi::funcs::GetBlas会报错说calling a host function from a global function is not allowed

cublas函数都是直接在host端调用的,内部可以理解为就是调用了cuda kernel,所以无法在global函数中再调用global函数

Copy link

@ZzSean ZzSean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ZzSean ZzSean merged commit cc3d3fc into PaddlePaddle:master Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants